In today’s society, music is the language of communication, and most musicians compose music to convey a specific message to politicians and public figures.
It’s fascinating to listen to music that everyone else is listening to based on your playlist, and the Spotify API we acquired from canvas contains a wealth of information that allows us to determine the popularity of the song.
The goal of this project is to analyze and visualize data from the GlobalMusicData database. The data set includes detailed information about the performers, as well as their tracks, genres, and playlists. Since 1993, the GlobalMusicData data set has contained information on track names, albums, playlists, genres, and much more for various artists.
Through the processes below, we utilized R to perform data analysis and visualization to investigate and detect trends in the artists’ recordings, as well as uncover insights to understand through the following steps:
For more information about http Spotify click here:API
library(readr) #will be used to read csv file
library(plotly) # will be used to make interactive, publication-quality graphs.
library(tidyr) # will be used to tidy up data
library(GGally) #extension of ggplot2 with functions
library(prettydoc) # used to document themes for R Markdown
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(lubridate) # used for date/time functions
library(magrittr) # used for piping
library(ggplot2) # used for data visualization
library(dplyr) # used for data manipulation
The code used to assess the variables in the raw data is as follows. We discovered that the data set has 32,833 observations and 33 variables, which are given below.
# Importing the data
data <- read.csv("Global Music Data.csv", header = TRUE, sep = ",")
Within a dataset, correcting or eliminating incorrect, corrupted, improperly formatted, duplicate, or incomplete data. There are numerous ways for data to be duplicated or mislabeled when merging multiple data sources.
#Computing summary statistics for the variables
datatable(
summary(data)
)
#Identifying the data types of each variable
datatable(
str(data)
)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "14/6/2019" "13/12/2019" "5/7/2019" "19/7/2019" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#dinfing missing data
#number of missing values in this data frame.
sum(is.na(data))
## [1] 15
#Count the number of missing values per column
colSums(is.na(data))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
In order to work on a clean dataset the data collected was cleaned
It’s straightforward to remove incomplete records from your analysis by passing your data frame or matrix through the na.omit() method. It’s a quick approach to get rid of na values in r.
#Remove missing data
#store new cleaned data to data1
data1 <- na.omit(data)
#Return the column names without missing values
names((colSums(is.na(data))>0))
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
# Read first 10 rows of the cleaned data set
datatable(head(data1, 10),options = list(scrollX=TRUE, pageLength=5))
# Read last 10 rows of the cleaned data set
datatable(tail(data1, 10),options = list(scrollX=TRUE, pageLength=5))
Plots can also be embedded, for example:
pairs(~danceability+energy+key+loudness,data = data1,
main = "Scatterplot Matrix For GlobalMusicData")
1.For Danceability, energy,key, loudness
From the scatter plots above:
Most of the music in the dataset ranged from 0.2 to 0.9 range of danceability and the dataset was fairly distributed across the range.
The energy scale accelerated in an ascending order with very few music registering low energy levels.
From the loudness scatter plot we could tell that most of the music was highly rated loud with and exception of few.
pairs(~acousticness+liveness+tempo+instrumentalness,data = data1,
main = "Scatterplot Matrix For GlobalMusicData")
2.Accousticness, Liveness,Tempo,Instrumentalness
From the scatter plot above:
Most of the music accoustic densly ranged between 0.0 and 0.4.
The music livelyness ranged between 0.0 to 0.4 then decreases as the liveliness ranged headed to 0.8 hence most of the music had an average to below average liveliness.
Int the music instumentalness scatter plot we noted that the music ranged between two extremes of either it had instrumentalness or little to no instrumentalness as the poulation was densly at 0-8 level and 0.0 level.
pairs(~mode+speechiness+duration_ms,data = data1,
main = "Scatterplot Matrix For GlobalMusicData")
3.Mode, Speechiness, Duration_ms
From the scatter plot above:
Mode more music had few mode in it.
Most of the music was less speeachy.
That most of the music ws short and precise.
ggplot(data1, aes(x = playlist_genre,y=track_popularity)) +
#set limits
scale_y_continuous(labels = scales:: comma) +
#customize bars
geom_bar(color="black",
fill = "pink",
width= 0.5,
stat='identity') +
#adding values numbers
geom_text(aes(label = track_popularity),
vjust = -0.25) +
#customize x,y axes and title
ggtitle("Graph showing popularity Playlist genre") +
xlab("Playlist genre") +
ylab("Popularity of the Track") +
#change font
theme(plot.title = element_text(color="black", size=14, face="bold", hjust = 0.5 ),
axis.title.x = element_text(color="black", size=11, face="bold"),
axis.title.y = element_text(color="black", size=11, face="bold"))
below showed the polurarity of music through genre
##Histogram
ggplot(data1, aes(x=playlist_genre)) +geom_bar()
##Histogram
ggplot(data, aes(x=playlist_subgenre)) +geom_bar() +coord_flip()
# Box plots
bp <- ggplot(data, aes(x=duration_ms, y=playlist_genre, fill=playlist_genre)) +
geom_boxplot()+
labs(title="Plot of Duration against playlist genre",x="Duration in (ms)", y = "Playlist genre")
bp + theme_classic()
Playlist_subgenere vs duration:
# Box plots
bp <- ggplot(data, aes(x=duration_ms, y=playlist_subgenre, fill=playlist_subgenre)) +
geom_boxplot()+
labs(title="Plot of Duration against playlist subgenre",x="Duration in (ms)", y = "Playlist subgenre")
bp + theme_classic()
In summary from the box chart Rock. edm, rnb, and rap were the most popular genre in todays music industry. the use of cleaned data made it easy to work on graph calculations and also get a consise numerical data for the Global Music dataset. the use of summary computations, graphic visualization such as bar chart, histogram and box chart made it easier to address the problem statements.
From the analysis we noted that eds, rock and rnb subgenres were the most popular genres in today’s industry.
That all the music was categorized in the pop genre.
Also that there were a total of 15 missing data in the uncleaned dataset.
The dataset comprised of 21 divisions. i.e columns with different atributes.
Data Processing The items are stored in physical memory in R. In contrast to other languages such as Python, this is not the case. Furthermore, when compared to Python, R uses more memory. R also mandates that all data be stored in a single location, namely memory. As a result, while dealing with Big Data, it is not the best option. However, with data management packages and Hadoop connectivity, this is readily addressed.
Safety and Security R is insecure in many ways. Most programming languages, such as Python, include this functionality. As a result, R has a number of limitations, including the inability to be incorporated in a web application.
Difficult Language R is a difficult language to master. The learning curve is quite steep. owing to